nlp_architect.models.np2vec.NP2vec

class nlp_architect.models.np2vec.NP2vec(corpus, corpus_format='txt', mark_char='_', word_embedding_type='word2vec', sg=0, size=100, window=10, alpha=0.025, min_alpha=0.0001, min_count=5, sample=1e-05, workers=20, hs=0, negative=25, cbow_mean=1, iterations=15, min_n=3, max_n=6, word_ngrams=1, prune_non_np=True)[source]

Initialize the np2vec model, train it, save it and load it.

__init__(corpus, corpus_format='txt', mark_char='_', word_embedding_type='word2vec', sg=0, size=100, window=10, alpha=0.025, min_alpha=0.0001, min_count=5, sample=1e-05, workers=20, hs=0, negative=25, cbow_mean=1, iterations=15, min_n=3, max_n=6, word_ngrams=1, prune_non_np=True)[source]

Initialize np2vec model and train it.

Parameters:
  • corpus (str) – path to the corpus.
  • corpus_format (str {json,txt,conll2000}) – format of the input marked corpus; txt and json
  • are supported. For json format, the file should contain an iterable of (formats) –
  • Each sentence is a list of terms (sentences.) –
  • training.
  • mark_char (char) – special character that marks NP’s suffix.
  • word_embedding_type (str {word2vec,fasttext}) – word embedding model type; word2vec and
  • are supported. (fasttext) –
  • np2vec_model_file (str) – path to the file where the trained np2vec model has to be
  • stored.
  • binary (bool) – boolean indicating whether the model is stored in binary format; if
  • is fasttext and word_ngrams is 1, binary should be set to True. (word_embedding_type) –
  • sg (int {0,1}) – model training hyperparameter, skip-gram. Defines the training
  • If 1, CBOW is used,otherwise, skip-gram is employed. (algorithm.) –
  • size (int) – model training hyperparameter, size of the feature vectors.
  • window (int) – model training hyperparameter, maximum distance between the current and
  • word within a sentence. (predicted) –
  • alpha (float) – model training hyperparameter. The initial learning rate.
  • min_alpha (float) – model training hyperparameter. Learning rate will linearly drop to
  • as training progresses. (min_alpha) –
  • min_count (int) – model training hyperparameter, ignore all words with total frequency
  • than this. (lower) –
  • sample (float) – model training hyperparameter, threshold for configuring which
  • words are randomly downsampled, useful range is (higher-frequency) –
  • workers (int) – model training hyperparameter, number of worker threads.
  • hs (int {0,1}) – model training hyperparameter, hierarchical softmax. If set to 1,
  • softmax will be used for model training. If set to 0, and negative is non- (hierarchical) – zero, negative sampling will be used.
  • negative (int) – model training hyperparameter, negative sampling. If > 0, negative
  • will be used, the int for negative specifies how many "noise words" should be (sampling) –
  • drawn (usually between 5-20) –
  • cbow_mean (int {0,1}) – model training hyperparameter. If 0, use the sum of the context
  • vectors. If 1, use the mean, only applies when cbow is used. (word) –
  • iterations (int) – model training hyperparameter, number of iterations.
  • min_n (int) – fasttext training hyperparameter. Min length of char ngrams to be used
  • training word representations. (for) –
  • max_n (int) – fasttext training hyperparameter. Max length of char ngrams to be used for
  • word representations. Set max_n to be lesser than min_n to avoid char (training) –
  • being used. (ngrams) –
  • word_ngrams (int {0,1}) – fasttext training hyperparameter. If 1, uses enrich word
  • with subword (vectors) –
  • prune_non_np (bool) – indicates whether to prune non-NP’s after training process.

Methods

__init__(corpus[, corpus_format, mark_char, …]) Initialize np2vec model and train it.
is_marked(s) Check if a string is marked.
load(np2vec_model_file[, binary, …]) Load the np2vec model.
save([np2vec_model_file, binary, …]) Save the np2vec model.
is_marked(s)[source]

Check if a string is marked.

Parameters:s (str) – string to check
classmethod load(np2vec_model_file, binary=False, word_ngrams=0, word2vec_format=True)[source]

Load the np2vec model.

Parameters:
  • np2vec_model_file (str) – the file containing the np2vec model to load
  • binary (bool) – boolean indicating whether the np2vec model to load is in binary format
  • word_ngrams (int {1,0}) – If 1, np2vec model to load uses word vectors with subword (
  • information. (ngrams)) –
  • word2vec_format (bool) – boolean indicating whether the model to load has been stored in
  • word2vec format. (original) –
Returns:

np2vec model to load

save(np2vec_model_file='np2vec.model', binary=False, word2vec_format=True)[source]

Save the np2vec model.

Parameters:
  • np2vec_model_file (str) – the file containing the np2vec model to load
  • binary (bool) – boolean indicating whether the np2vec model to load is in binary format
  • word2vec_format (bool) – boolean indicating whether to save the model in original
  • format. (word2vec) –